library(tidyverse)     # for graphing and data cleaning
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)     # for date manipulation
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggthemes)      # for even more plotting themes
library(gganimate)     # for adding animation layers to ggplots
library(RColorBrewer)  # for color palettes
library(viridis)
## Loading required package: viridisLite
library(plotly)        # for the ggplotly() - basic interactivity
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(gganimate)     # for adding animation layers to ggplots
library(transformr)    # for "tweening" (gganimate)
library(gifski)        # need the library for creating gifs but don't need to load each time
library(gt)
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:viridis':
## 
##     unemp
## The following object is masked from 'package:purrr':
## 
##     map
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
## 
## Attaching package: 'ggmap'
## The following object is masked from 'package:plotly':
## 
##     wind
theme_set(theme_minimal()) # My favorite ggplot() theme :)
freq_theme_words <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/word_themes_freq.csv")
freq_country_words <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/word_country_freq.csv")
headline_site <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/headlines_site.csv")
word_theme_rank <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/word_themes_rank.csv")
headline_examples <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/headlines.csv")
polarity_site <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/polarity_comparison_site_country_time.csv")
polarity_over_time <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/polarity_comparison_country_time.csv")

Data was taken from news sites from four different countries with varying numbers of news sources used. From the United States of America, 86 new sites were used. From the United Kingdom, 41 news sites were used. From South Africa, 23 news sites were used. From India, 36 news sites were used. The data taken involved the frequency of words used sorted by theme and frequency of words by country. The themes are crime and violence, empowerment, female stereotypes, people and places, race, ethnicity and identity, and no theme. The words were also ranked by theme based on frequency of word use. The news sites were also assigned values for bias and polarity, the calculations are explained at the bottom of the article. There are also headline examples which are individually given a bias score.

world_map <- map_data("world")

headline_site %>% 
  group_by(country_of_pub) %>% 
  summarise(bias_country = mean(bias)) %>% 
  ggplot() +
    geom_map(data = world_map, map = world_map,
             aes(long, lat, map_id = region),
             fill = "lightgray")+
    geom_map(map = world_map,
            aes(map_id = `country_of_pub`),
            fill = "springgreen4",
            color = "springgreen4")+
    expand_limits(x = world_map$long, y = world_map$lat) + 
    theme_map()
## Warning: Ignoring unknown aesthetics: x, y

A world map with the USA, the UK, South Africa, and India colored in green to signify where data was taken from.

The average bias of news sites often varies from the minimum and maximum bias values given to different headlines. The following column chart displays the mean bias by country along with the maximum bias of a headline published by a site in the country. The minimum bias score is zero for all countries so no visual representation was added.

headline_site %>%
  group_by(country_of_pub) %>%
  summarize(mean_bias = mean(bias), max_bias = max(bias)) %>% 
  ggplot()+
  geom_col(aes(y = country_of_pub, x = max_bias), fill = "lightblue2", width = .75)+
  geom_col(aes(y = country_of_pub, x = mean_bias), width = .5, fill = "tan2")+
  scale_x_continuous(limits = c(0, 1))+
  labs(title = "Average and Maximum Bias Score by Country",
       x = "Bias",
       y = "Country of Publication")+
  theme(plot.title = element_text(hjust = 0.5))

A cumulative bar graph for the words used to describe women used in headlines. They are divided into 5 main categories with crime and violence having the most words and the highest frequency. The graph is interactive so each word can be highlighted with the individual word and frequency.

pivot_words <- freq_theme_words %>% 
  pivot_longer(cols = -theme,
               names_to = "word",
               values_to = "freq") %>% 
  na.omit()

word_plot <- pivot_words %>% 
  filter(theme != "No theme") %>% 
  ggplot(aes(x = theme, 
             y = freq, 
             fill = fct_reorder(word, freq),
             text = paste("word:", word))) +
    geom_col(color = "black") +
    theme(legend.position = "none") +
    # scale_fill_manual(values = c("darkslateblue", "lightblue2", "tan2"),
    #                   breaks = waiver())+
    labs(title = "Cumulative Frequency of Words describing Women in Headlines",
       x = "",
       y = "Frequency")+
    theme(plot.title = element_text(hjust = 0.5))


ggplotly(word_plot,
         tooltip = c("y", "text"))

The words taken from headlines across different news sites were sorted into theme categories and ranked by occurrence. The following column chart describes the top five words used sorted by theme with the word ‘man’ appearing almost triple the average word use. Crime and violence have the highest average word count of any theme.

word_theme_rank %>% 
  filter(`rank` < 6) %>% 
  select(!`X`) %>% 
  ggplot(aes(y = fct_reorder(word, theme), x = count)) +
  geom_col(aes(fill = theme))+
  scale_fill_viridis_d(option = "viridis") +
  #theme(legend.position = "none")+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(title = "Count of Top 5 words per Theme",
       y = "",
       x = "")

The average polarity of news headlines in regards to women has been higher for the past ten years. I am unsure how to describe polarity well.

polarity_over_time %>% 
  group_by(`year`) %>% 
  summarise(women_mean = mean(`women_polarity_mean`),
            all_mean = mean(`all_polarity_mean`),
            year) %>% 
  ggplot()+
  geom_smooth(aes(x=`year`, y=`women_mean`), color = "springgreen4", se = FALSE)+
  geom_smooth(aes(x=`year`, y=`all_mean`), color = "black", se = FALSE)+
  geom_point(aes(x=2020.0, y=0.425), 
             color = "black", fill = "springgreen4", 
             size = 5, stroke = 2, shape = 21) +
  geom_point(aes(x=2020.0, y=0.28), size = 2.5)+
  geom_label(label = "Headlines about\nwomen", x= 2019.4, y=0.40, color = "springgreen4")+
  geom_label(label = "Headlines about\nother topics", x=2019.4, y= 0.25)+
  scale_x_continuous(breaks = c(2010, 2012, 2014, 2016, 2018, 2020))+
  labs(title = "Average of Polarity of News Headlines over Time",
       y = "",
       x = "")+
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.line.x = element_line(color = "black"))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Polarity from sites, base polarity (black) to women polarity (green) with the differences as a line segment, they are ordered by polarity of women value, not by differences

polarity_site %>% 
  ggplot()+
  geom_segment(aes(x=polarity_base, xend=polarity_women, y=fct_reorder(site, polarity_women), yend=site), size = 1)+
  geom_point(aes(x=polarity_base, y = site), size = 2)+
  geom_point(aes(x=polarity_women, y = site), color = "black", fill = "springgreen4", 
             size = 3, stroke = 1, shape = 21)+
  labs(title = "Polarity of News Outlines:\n Headlines about Women vs. Headlines about other topics",
       y = "",
       x = "Polarity")+
  theme(plot.title = element_text(hjust = 0.5))

Three examples of headlines with no calculated bias

last_three_headlines <- headline_examples %>% 
  rename("Headline" = `headline_no_site`,
         "Site" = `site`,
         "Country" = `country`,
         "Bias" = `bias`) %>%
  arrange(`Bias`) %>%
  distinct(Site, .keep_all = TRUE) %>% 
  slice(1:3) %>% 
  select(`Headline`, `Site`, `Country`, `Bias`)

last_three_headlines_table <- gt(last_three_headlines) %>% 
  tab_header(title = "Least Biased Headline Examples") %>% 
  data_color(columns = vars(`Headline`, `Site`, `Country`, `Bias`), 
             colors = '#bccae0')
## Warning: `columns = vars(...)` has been deprecated in gt 0.3.0:
## * please use `columns = c(...)` instead
last_three_headlines_table
Least Biased Headline Examples
Headline Site Country Bias
'Lady Bird' buzzes through young sexuality iol.co.za South Africa 0
American Woman, Divorced From Saudi Husband, Is Trapped in Saudi Arabia msn.com India 0
'SA poorer without her' SACP reacts to Madikizela Mandela's death News24.com South Africa 0

Three examples of highly biased headlines

top_three_headlines <- headline_examples %>% 
  rename("Headline" = `headline_no_site`,
         "Site" = `site`,
         "Country" = `country`) %>% 
  mutate(Bias = round(bias, digits = 3)) %>% 
  arrange(desc(`Bias`)) %>%
  distinct(Site, .keep_all = TRUE) %>% 
  slice(1:3) %>% 
  select(`Headline`, `Site`, `Country`, `Bias`)

top_three_headlines_table <- gt(top_three_headlines) %>% 
  tab_header(title = "Most Biased Headline Examples") %>% 
  data_color(columns = vars(`Headline`, `Site`, `Country`, `Bias`), 
             colors = '#bccae0')
## Warning: `columns = vars(...)` has been deprecated in gt 0.3.0:
## * please use `columns = c(...)` instead
top_three_headlines_table
Most Biased Headline Examples
Headline Site Country Bias
Girl with severe eczema told her mum she 'didn't want to look at herself in the mirror' she's now a model manchestereveningnews.co.uk UK 1.000
A Mother Said Her 9 Year Old Daughter Killed Herself Because She Was Bullied For Being Friends With A White Boy buzzfeed.com UK 0.833
Wuthering Heights actress Merle Oberon's secret that she took to the grave... her sister was her mother who ga dailymail.co.uk India 0.833

POLARITY CALCULATIONS We measure polarity by performing sentiment analysis on each headline using the Vader python package, where each headline gets a sentiment score from -1 to 1 (from more negative to more positive). Because we are interested in polarity, we take the absolute value of each headline’s score.

BIAS CALCULATIONS We measure gender bias by tracking the combined occurrence of gendered language and social stereotypes usually associated with women. We do this in two steps: 1) We check if a headline contains gendered language (i.e. “spokeswoman,” “chairwoman,” “she,” “her,” “bride,” “daughter,” “daughters,” “female,” “fiancee,” “girl,” “girlfriend” etc.). 2) If it contains gendered language, we then count the number of words that are considered to be social stereotypes about women (i.e. “weak,” “modest,” “virgin,” “slut,” “whore,” “sexy,” “feminine,” “sensitive,” “emotional,” “gentle,” “soft,” “pretty,” “bitch,” “sexual” etc.). Finally, we normalize this count for all headlines within each outlet as a score between 0 and 1, and we aggregate (i.e. average) this score for each outlet. (site from pudding https://pudding.cool/2022/02/women-in-headlines/)

---
title: "Headlines"
author: "Audrey Smyczek"
date: "4/14/2022"
output: 
  html_document:
    df_print: paged
    code_download: true
---

```{r setup, include=FALSE}
#knitr::opts_chunk$set(echo = TRUE)
```

```{r libraries}
library(tidyverse)     # for graphing and data cleaning
library(lubridate)     # for date manipulation
library(ggthemes)      # for even more plotting themes
library(gganimate)     # for adding animation layers to ggplots
library(RColorBrewer)  # for color palettes
library(viridis)
library(plotly)        # for the ggplotly() - basic interactivity
library(gganimate)     # for adding animation layers to ggplots
library(transformr)    # for "tweening" (gganimate)
library(gifski)        # need the library for creating gifs but don't need to load each time
library(gt)
library(maps)
library(ggmap)
theme_set(theme_minimal()) # My favorite ggplot() theme :)
```

```{r}
freq_theme_words <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/word_themes_freq.csv")
freq_country_words <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/word_country_freq.csv")
headline_site <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/headlines_site.csv")
word_theme_rank <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/word_themes_rank.csv")
headline_examples <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/headlines.csv")
polarity_site <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/polarity_comparison_site_country_time.csv")
polarity_over_time <- read.csv("https://raw.githubusercontent.com/the-pudding/data/master/women-in-headlines/polarity_comparison_country_time.csv")
```


```{r, echo = FALSE}
pivot_country_word <- freq_country_words %>% 
  pivot_longer(cols = -country,
               names_to = "word",
               values_to = "number") %>% 
  filter(word != "X") %>% 
  na.omit()
```



Data was taken from news sites from four different countries with varying numbers of news sources used. From the United States of America, 86 new sites were used. From the United Kingdom, 41 news sites were used. From South Africa, 23 news sites were used. From India, 36 news sites were used. The data taken involved the frequency of words used sorted by theme and frequency of words by country. The themes are crime and violence, empowerment, female stereotypes, people and places, race, ethnicity and identity, and no theme. The words were also ranked by theme based on frequency of word use. The news sites were also assigned values for bias and polarity, the calculations are explained at the bottom of the article. There are also headline examples which are individually given a bias score.

```{r, fig.alt= "A world map with the USA, the UK, South Africa, and India colored in green to signify where data was taken from."}
world_map <- map_data("world")

headline_site %>% 
  group_by(country_of_pub) %>% 
  summarise(bias_country = mean(bias)) %>% 
  ggplot() +
    geom_map(data = world_map, map = world_map,
             aes(long, lat, map_id = region),
             fill = "lightgray")+
    geom_map(map = world_map,
            aes(map_id = `country_of_pub`),
            fill = "springgreen4",
            color = "springgreen4")+
    expand_limits(x = world_map$long, y = world_map$lat) + 
    theme_map()
```


The average bias of news sites often varies from the minimum and maximum bias values given to different headlines. The following column chart displays the mean bias by country along with the maximum bias of a headline published by a site in the country. The minimum bias score is zero for all countries so no visual representation was added.

```{r}
headline_site %>%
  group_by(country_of_pub) %>%
  summarize(mean_bias = mean(bias), max_bias = max(bias)) %>% 
  ggplot()+
  geom_col(aes(y = country_of_pub, x = max_bias), fill = "lightblue2", width = .75)+
  geom_col(aes(y = country_of_pub, x = mean_bias), width = .5, fill = "tan2")+
  scale_x_continuous(limits = c(0, 1))+
  labs(title = "Average and Maximum Bias Score by Country",
       x = "Bias",
       y = "Country of Publication")+
  theme(plot.title = element_text(hjust = 0.5))
```


A cumulative bar graph for the words used to describe women used in headlines. They are divided into 5 main categories with crime and violence having the most words and the highest frequency. The graph is interactive so each word can be highlighted with the individual word and frequency.

```{r}
pivot_words <- freq_theme_words %>% 
  pivot_longer(cols = -theme,
               names_to = "word",
               values_to = "freq") %>% 
  na.omit()

word_plot <- pivot_words %>% 
  filter(theme != "No theme") %>% 
  ggplot(aes(x = theme, 
             y = freq, 
             fill = fct_reorder(word, freq),
             text = paste("word:", word))) +
    geom_col(color = "black") +
    theme(legend.position = "none") +
    # scale_fill_manual(values = c("darkslateblue", "lightblue2", "tan2"),
    #                   breaks = waiver())+
    labs(title = "Cumulative Frequency of Words describing Women in Headlines",
       x = "",
       y = "Frequency")+
    theme(plot.title = element_text(hjust = 0.5))


ggplotly(word_plot,
         tooltip = c("y", "text"))
```


The words taken from headlines across different news sites were sorted into theme categories and ranked by occurrence. The following column chart describes the top five words used sorted by theme with the word 'man' appearing almost triple the average word use. Crime and violence have the highest average word count of any theme. 

```{r}
word_theme_rank %>% 
  filter(`rank` < 6) %>% 
  select(!`X`) %>% 
  ggplot(aes(y = fct_reorder(word, theme), x = count)) +
  geom_col(aes(fill = theme))+
  scale_fill_viridis_d(option = "viridis") +
  #theme(legend.position = "none")+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(title = "Count of Top 5 words per Theme",
       y = "",
       x = "")
```


The average polarity of news headlines in regards to women has been higher for the past ten years. 
I am unsure how to describe polarity well.

```{r}
polarity_over_time %>% 
  group_by(`year`) %>% 
  summarise(women_mean = mean(`women_polarity_mean`),
            all_mean = mean(`all_polarity_mean`),
            year) %>% 
  ggplot()+
  geom_smooth(aes(x=`year`, y=`women_mean`), color = "springgreen4", se = FALSE)+
  geom_smooth(aes(x=`year`, y=`all_mean`), color = "black", se = FALSE)+
  geom_point(aes(x=2020.0, y=0.425), 
             color = "black", fill = "springgreen4", 
             size = 5, stroke = 2, shape = 21) +
  geom_point(aes(x=2020.0, y=0.28), size = 2.5)+
  geom_label(label = "Headlines about\nwomen", x= 2019.4, y=0.40, color = "springgreen4")+
  geom_label(label = "Headlines about\nother topics", x=2019.4, y= 0.25)+
  scale_x_continuous(breaks = c(2010, 2012, 2014, 2016, 2018, 2020))+
  labs(title = "Average of Polarity of News Headlines over Time",
       y = "",
       x = "")+
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.line.x = element_line(color = "black"))
```



Polarity from sites, base polarity (black) to women polarity (green) with the differences as a line segment, they are ordered by polarity of women value, not by differences

```{r, fig.height= 24, fig.width= 8}
polarity_site %>% 
  ggplot()+
  geom_segment(aes(x=polarity_base, xend=polarity_women, y=fct_reorder(site, polarity_women), yend=site), size = 1)+
  geom_point(aes(x=polarity_base, y = site), size = 2)+
  geom_point(aes(x=polarity_women, y = site), color = "black", fill = "springgreen4", 
             size = 3, stroke = 1, shape = 21)+
  labs(title = "Polarity of News Outlines:\n Headlines about Women vs. Headlines about other topics",
       y = "",
       x = "Polarity")+
  theme(plot.title = element_text(hjust = 0.5))
```



Three examples of headlines with no calculated bias

```{r}
last_three_headlines <- headline_examples %>% 
  rename("Headline" = `headline_no_site`,
         "Site" = `site`,
         "Country" = `country`,
         "Bias" = `bias`) %>%
  arrange(`Bias`) %>%
  distinct(Site, .keep_all = TRUE) %>% 
  slice(1:3) %>% 
  select(`Headline`, `Site`, `Country`, `Bias`)

last_three_headlines_table <- gt(last_three_headlines) %>% 
  tab_header(title = "Least Biased Headline Examples") %>% 
  data_color(columns = vars(`Headline`, `Site`, `Country`, `Bias`), 
             colors = '#bccae0')

last_three_headlines_table
```


Three examples of highly biased headlines

```{r}
top_three_headlines <- headline_examples %>% 
  rename("Headline" = `headline_no_site`,
         "Site" = `site`,
         "Country" = `country`) %>% 
  mutate(Bias = round(bias, digits = 3)) %>% 
  arrange(desc(`Bias`)) %>%
  distinct(Site, .keep_all = TRUE) %>% 
  slice(1:3) %>% 
  select(`Headline`, `Site`, `Country`, `Bias`)

top_three_headlines_table <- gt(top_three_headlines) %>% 
  tab_header(title = "Most Biased Headline Examples") %>% 
  data_color(columns = vars(`Headline`, `Site`, `Country`, `Bias`), 
             colors = '#bccae0')

top_three_headlines_table
```



POLARITY CALCULATIONS
We measure polarity by performing sentiment analysis on each headline using the Vader python package, where each headline gets a sentiment score from -1 to 1 (from more negative to more positive). Because we are interested in polarity, we take the absolute value of each headline's score.

BIAS CALCULATIONS
We measure gender bias by tracking the combined occurrence of gendered language and social stereotypes usually associated with women. We do this in two steps:
1) We check if a headline contains gendered language (i.e. “spokeswoman,” “chairwoman,” “she,” “her,” “bride,” “daughter,” “daughters,” “female,” “fiancee,” “girl,” “girlfriend” etc.).
2) If it contains gendered language, we then count the number of words that are considered to be social stereotypes about women (i.e. “weak,” “modest,” “virgin,” “slut,” “whore,” “sexy,” “feminine,” “sensitive,” “emotional,” “gentle,” “soft,” “pretty,” “bitch,” “sexual” etc.).
Finally, we normalize this count for all headlines within each outlet as a score between 0 and 1, and we aggregate (i.e. average) this score for each outlet.
(site from pudding https://pudding.cool/2022/02/women-in-headlines/)


